Generalized linear models (GLM) as the name implies are a generalization of the linear modeling framework to allow for the modeling of response variables (e.g. soil attributes) with non-normal distributions and heterogeneous variances. Whereas linear models are designed for predicting continuous soil properties such as clay content or soil temperature, GLM can be used to predict the presence/absence of argillic horizons (i.e. logistic regression) or counts of a plant species along a transect (i.e. Poisson regression). These generalizations greatly expand the applicability of the linear modeling framework, while still allowing for a similar fitting procedure and interpretation of the resulting models.
In the past in order to handle non-linearity and heterogeneous variances, transformations have been made to the response variable, such as the log(x). However, such transformations complicate the models interpretation because the results refer to the transformed scale (e.g. log(x)). These response transformations are not guaranteed to achieve both normality and constant variance simultaneously. GLM approaches transform the response, but also preserve the scale of the response, and provide separate functions to transform the mean response and variance, known as the link and variance functions respectively. So instead of looking like this:
\(f(y) = \beta_{0} + \beta_{1}x + \varepsilon\)
you get this:
\(g(\mu)\) or \(\eta = \beta_{0} + \beta_{1}x + \varepsilon\)
with \(g(\mu)\) or \(\eta\) symbolizing the link function.
Another alteration of the classical linear model is that with GLM the coefficients are estimated iteratively by maximum likelihood estimation instead of ordinary least squares. This results in the GLM minimizing the deviance, instead of the sum of squares. However, for the Gaussian (i.e. normal) distributions the deviance and sum of squares are equivalent.
Logistic regression is a specific type of GLM designed to model data that has a binomial distribution (i.e. presence/absence, yes/no, or proportional data), which in statistical learning parlance is considered a classification problem. For binomial data the logit link transform is generally used. The effect of the logit transform can be seen in the following figure. It creates a sigmoidal curve, which enhances the separation between the two groups. It also has the effect of ensuring that the values range between 0 and 1.
When comparing a simple linear model vs a simple logistic model we can see the effect of the logit transform on the relationship between the response and predictor variable. As before it follows a sigmoidal curve and prevents predictions from exceeding 0 and 1.
Examples of Logistic regression output showing probability of red clay parent material, mollisol and ponded components:
This example will provide a quick introduction to logistic regression by exploring the presence of soils with spodic characteristics in the Central Appalachians of West Virginia. Spodisols and soils with spodic properties form under the process of podzolization. The process of podzolization involves the removal (eluviation) of organic material, aluminum and iron from upper soil horizons (O, A and E) and the accumulation (illuviation) of these materials in the subsoil spodic horizon(s). In this region, these soils are associated with the past and present occurence of red spruce forest cover.
Load the required packages and set the working directory. Change the working directory to accomodate your working environment.
require(sp)
require(raster)
require(rgdal)
require(rms)
setwd("C:/workspace")
Select sample file, create data frame and view the first few records
githubURL <- "https://raw.githubusercontent.com/ncss-tech/stats_for_soil_survey/master/data/logistic/wv_transect_editedforR.csv"
pts <- read.csv(githubURL)
names(pts)
## [1] "x" "y" "Overtype" "Underconifer"
## [5] "Oi" "Oe" "Oa" "Otot"
## [9] "epipedon" "spodint" "subgroup" "order"
## [13] "ps" "drainage" "series" "taxon"
## [17] "slope" "surfacetex" "stoniness" "depthclass"
## [21] "bedrockdepth" "depth_cm" "hillslope" "tipmound"
## [25] "rainfall" "geology" "aachn" "dem10m"
## [29] "downslpgra" "eastness" "greenrefl" "landsatb1"
## [33] "landsatb2" "landsatb3" "landsatb7" "maxc100"
## [37] "maxent" "minc100" "mirref" "ndvi"
## [41] "northeastn" "northness" "northwestn" "planc100"
## [45] "proc100" "protection" "relpos11" "slp50"
## [49] "solar" "tanc75"
Soil scientists developed a local, ad-hoc classification of the intensity of spodic expression for use in the region that is an ordered data type. This will need to be converted to a binary classification for modeling purposes.
Add a column called spod_pres_cons to the pts object that converts spodint to a binary variable
pts$spod_pres_cons <- ifelse(pts$spodint <= 1, 0, 1)
Create a model using dem10m, eastness, northness, and maxent to predict spod_pres_con and view the summary
GLM.1 <- lrm(spod_pres_cons ~ dem10m + eastness + northness + maxent, data=pts)
print(GLM.1)
## Logistic Regression Model
##
## lrm(formula = spod_pres_cons ~ dem10m + eastness + northness +
## maxent, data = pts)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 250 LR chi2 52.63 R2 0.268 C 0.773
## 0 174 d.f. 4 g 1.339 Dxy 0.546
## 1 76 Pr(> chi2) <0.0001 gr 3.817 gamma 0.547
## max |deriv| 1e-09 gp 0.234 tau-a 0.232
## Brier 0.167
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept 7.4505 2.3284 3.20 0.0014
## dem10m -0.0090 0.0023 -3.82 0.0001
## eastness -0.8647 0.2563 -3.37 0.0007
## northness 0.6756 0.2310 2.93 0.0034
## maxent 0.0315 0.0086 3.67 0.0002
##
The summary will look similar to the following:
Summary output
Evaluating the results involves review of several key values as noted in the summary figure:
Run a prediction of the model using the dem10m, eastness, northness, and maxent raster files. These files will be downloaded, a raster stack will be built and the GLM.1 model applied. When using your own data, the “stack” code only works if all rasters are co-registered, are .img files, have the same projection and spatial extent, and are stored in your working directory. In practice, other GDAL file formats should also work.
# githubURL <- "https://raw.githubusercontent.com/ncss-tech/stats_for_soil_survey/master/data/logistic/wv_raster.RData"
# load(url(githubURL))
load("C:/workspace/wv_raster.RData")
# glm_r <- predict(rasters, GLM.1, filename = "spodic_pres_GLM1.img", type = "fitted", progress = "text", overwrite = TRUE)
glm_r <- raster("C:/workspace/spodic_pres_GLM1.img")
plot(glm_r)
Now that we’ve discussed some of the basic background GLM theory we’ll move on to a real example, and address any additional theory where it relates to specific steps in the modeling process. The examples selected for this chapter come from Joshua Tree National Park (JTNP)(i.e. CA794) in the Mojave desert. The problem tackled here is a familiar one: Where can I expect to find argillic horizons on fan piedmonts? Argillic horizons within the Mojave are typically found on fan remnants, which are a stable landform that is a remnant of the Pleistocene (Peterson, 1981). Despite the low relief of most fans, fan remnants are uplands in the sense that they generally don’t receive run-on or active deposition.
With this dataset we’ll encounter some challenges. To start with, fan piedmont landscapes typically have relatively little relief. Since most of our predictors will be derivatives of elevation, that won’t leave us with much to work with. Also, our elevation data comes from the USGS National Elevation dataset (NED), which provides considerably less detail than say LiDAR or IFSAR data (Shi et al., 2012). Lastly our pedon dataset like most in NASIS, hasn’t received near as much quality control as have the components. So we’ll need to wrangle some of the pedon data before we can analyze it. These are all typical problems encountered in any data analysis and should be good practice. Ideally, it would be more interesting to try and model individual soil series with argillic horizons, but due to some of the challenges previously mentioned it would be difficult with this dataset. However, at the end we’ll look at one simple approach to try and separate individual soil series with argillic horizons.
To start, as always we need to load some extra packages. This will become a familiar routine every time you start R. Most of the basic functions we need to develop a logistic regression model are contained in base R, but the following contain some useful spatial and data manipulation functions. Believe it or not we will use all of them and more.
library(aqp) # specialized soil classes and functions
library(soilDB) # NASIS and SDA import functions
library(raster) # guess
library(rgdal) # spatial import
library(ggplot2) # graphing
library(reshape2) # data manipulation
library(caret) # classification and regression training
library(car) # additional regression tools
Hopefully like all good soil scientists and ecological site specialists you enter your field data into NASIS. Better yet hopefully someone else did it for you! Once data are captured in NASIS it is much easier to import the data into R, extract the pieces you need, manipulate it, model it, etc. If it’s not entered into NASIS, it may as well not exist.
# pedons <- fetchNASIS()
githubURL <- "https://raw.githubusercontent.com/ncss-tech/stats_for_soil_survey/master/data/ch7_data.Rdata"
load(url(githubURL))
str(pedons, max.level = 2) # Examine the makeup of the data we imported from NASIS.
## Formal class 'SoilProfileCollection' [package "aqp"] with 7 slots
## ..@ idcol : chr "peiid"
## ..@ depthcols : chr [1:2] "hzdept" "hzdepb"
## ..@ metadata :'data.frame': 1 obs. of 1 variable:
## ..@ horizons :'data.frame': 4990 obs. of 43 variables:
## ..@ site :'data.frame': 1168 obs. of 79 variables:
## ..@ sp :Formal class 'SpatialPoints' [package "sp"] with 3 slots
## ..@ diagnostic:'data.frame': 2133 obs. of 4 variables:
Generally before we begin modeling you should spend some time exploring the data. By examining a simple summary we can quickly see the breakdown of how many argillic horizons we have. Unfortunately, odds are good that all the argillic horizons haven’t been consistently populated in the diagnostic horizon table like they should be. Luckily for us, the desert argillic horizons always pop up in the taxonomic name, so we can use pattern matching to extract it. By doing this we gain an additional 11 pedons with argillic horizons and are able to label the missing values (i.e. NA). At a minimum for modeling purposes we probably need 10 pedons of the target we’re interested in and a total of 100 observations overall.
# Check consistency of argillic horizon population
s <- site(pedons) # get the site table
table(s$argillic.horizon, useNA = "ifany") # tabulate the number of argillic horizons observed
##
## FALSE TRUE <NA>
## 750 272 146
# or
# summary(s$argillic.horizon)
# Extract argillic presence from the taxonomic subgroup
s$argillic <- grepl("arg", s$tax_subgroup)
table(s$argillic, useNA = "ifany")
##
## FALSE TRUE
## 886 282
Ideally, if the diagnostic horizon table had been populated consistently we could have used the upper depth to diagnostic feature to filter out argillic horizons that start below 50cm, which may not be representative of “good” argillic horizons and may therefore have gotten correlated to a Torripsamments anyway. Not only are unrepresentative sites confusing for scientists, they’re equally confusing for models. However, as we saw earlier, some pedons don’t appear to be fully populated, so we’ll stick with those pedons that have the argillic specified in their taxonomic subgroup name, since it gives us the biggest sample.
d <- diagnostic_hz(pedons)
peiid <- d[d$diag_kind == "argillic horizon" & d$featdept < 50, "peiid"]
test <- s$peiid %in% unique(peiid)
summary(test)
## Mode FALSE TRUE NA's
## logical 940 228 0
Another obvious place to look is at the geomorphic data in the site table. This information is intended to help differentiate where our soil observations exist on the landscape. If populated consistently it could potentially be used in future disaggregation efforts, as demonstrated by Nauman and Thompson (2014).
# Landform vs argillic presence
# Subset
s_sub <- subset(s, argillic == TRUE)
# Cross tabulate landform vs argillic horizon presence
test <- with(s_sub,
table(landform.string, argillic, useNA = "ifany")
)
# Subset and print landform.string with > 3 observations
test[test > 3,]
## alluvial fan fan apron fan apron & fan remnant
## 9 27 25
## fan remnant hill hillslope
## 109 15 32
## low hill mountain mountain slope
## 5 6 8
## pediment <NA>
## 11 6
# generalize the landform.string
s$landform <- ifelse(grepl("fan|terrace|sheet|drainageway|wash", s$landform.string), "fan", "hill")
Examining the above frequency table we can see that argillic horizons occur predominantly on fan remnants as was alluded too earlier. However, they also seem to occur frequently on other landforms - some of which are curious combinations of landforms or redundant terms.
# Hillslope position
# Subset fan landforms
s_sub <- subset(s, landform == "fan")
# Cross tabulate and calculate proportions, the "2" calculates the proportions relative to the column totals
with(s_sub, round(
prop.table(table(hillslope_pos, argillic, useNA = "ifany"), 2)
* 100)
)
## argillic
## hillslope_pos FALSE TRUE
## Toeslope 18 5
## Footslope 2 2
## Backslope 15 16
## Shoulder 3 4
## Summit 15 32
## <NA> 47 40
# Slope shape
with(s_sub, round(
prop.table(table(paste(shapedown, shapeacross), argillic, useNA = "ifany"), 2)
* 100)
)
## argillic
## FALSE TRUE
## Concave Concave 1 1
## Concave Convex 0 0
## Concave Linear 4 1
## Convex Concave 0 0
## Convex Convex 8 8
## Convex Linear 7 7
## Linear Concave 7 3
## Linear Convex 20 29
## Linear Linear 41 45
## Linear NA 0 0
## NA NA 12 8
Looking at the hillslope position of fan landforms we can see a slightly higher proportion of argillic horizons are found on summits, while less are found on toeslopes. Slope shape doesn’t seem to provide any useful information for distinguishing argillic horizons.
# Surface morphometry, depth and surface rock fragments
# Recalculate gravel
s$surface_gravel <- with(s,
surface_gravel - surface_fgravel
)
# Calculate the total surface rock fragments
s$frags <- apply(s[grepl("surface", names(s))], 1, sum)
# Subset to just look and fans, and select numeric columns
s_sub <- subset(s, landform == "fan", select = c(argillic, bedrckdepth, slope_field, elev_field, frags))
s_m <- melt(s_sub, id = "argillic") # convert s_sub to wide data format
head(s_m, 2)
## argillic variable value
## 1 FALSE bedrckdepth NA
## 2 FALSE bedrckdepth 11
ggplot(s_m, aes(x = argillic, y = value)) +
geom_boxplot() +
facet_wrap(~ variable, scale = "free")
Looking at our numeric variables only depth to bedrock seems to show much separation between the presence/absence of argillic horizons.
Next we’ll look at soil scientist bias. The question being: Are some soil scientists more likely to describe argillic horizons than others? Due to the excess number of soil scientist that have worked on CA794, including detailees, we’ve filtered the names of soil scientist to include just the top 3 mappers and given priority to the most senior soil scientists when they occur together.
# Custom function to filter out the top 3 soil scientists
desc_test <- function(old) {
old <- as.character(old)
new <- NA
# ranked by seniority
if (is.na(old)) {new <- "other"}
if (grepl("Stephen", old)) {new <- "Stephen"} # least senior
if (grepl("Paul", old)) {new <- "Paul"}
if (grepl("Peter", old)) {new <- "Peter"} # most senior
if (is.na(new)) {new <- "other"}
return(new)
}
s$describer2 <- sapply(s$describer, desc_test)
s_sub <- subset(s, landform == "fan")
# By frequency
with(s_sub, table(describer2, argillic, useNA = "ifany"))
## argillic
## describer2 FALSE TRUE
## other 134 58
## Paul 64 28
## Peter 208 86
## Stephen 58 19
# By proportion
with(s_sub, round(
prop.table(table(describer2, argillic), margin = 1)
* 100)
)
## argillic
## describer2 FALSE TRUE
## other 70 30
## Paul 70 30
## Peter 71 29
## Stephen 75 25
For fan landforms, none of the soil scientists seem more likely than the others to describe argillic horizons. However while this information is suggestive, it is far from definitive in showing a potential bias because it doesn’t take into account other factors. We’ll examine this more closely later.
Where do our points plot? We can plot the general location in R, but for this task we will export them to a Shapefile, so we can view them in a proper GIS, and really inspect them. Notice in the figure below the number of points that fall outside the survey boundary. What it doesn’t show is the points that may plot in the Ocean or Mexico!
# Convert soil profile collection to a spatial object
pedons2 <- pedons
slot(pedons2, "site") <- s # this is dangerous, but something needs to be fixed in the site() setter function
idx <- complete.cases(site(pedons2)[c("x", "y")]) # create an index to filter out pedons with missing coordinates
pedons2 <- pedons2[idx]
coordinates(pedons2) <- ~ x + y # set the coordinates
proj4string(pedons2) <- CRS("+init=epsg:4326") # set the projection
pedons_sp <- as(pedons2, "SpatialPointsDataFrame") # coerce to spatial object
pedons_sp <- spTransform(pedons_sp, CRS("+init=epsg:5070")) # reproject
# Read in soil survey area boundaries
# ssa <- readOGR(dsn = "F:/geodata/soils/soilsa_a_nrcs.shp", layer = "soilsa_a_nrcs")
# ca794 <- subset(ssa, areasymbol == "CA794") # subset out Joshua Tree National Park
# ca794 <- spTransform(ca794, CRS("+init=epsg:5070"))
# Plot
plot(ca794, axes = TRUE)
plot(pedons_sp, add = TRUE) # notice the points outside the boundary
# Write shapefile of pedons
# writeOGR(pedons_sp, dsn = "F:/geodata/project_data/8VIC", "pedon_locations", driver = "ESRI Shapefile")
Prior to any spatial analysis or modeling, you will need to develop a suite of geodata files that can be intersected with your field data locations. This is, in and of itself a difficult task, and should be facilitated by your Regional GIS Specialist. Typically, these geodata files would primarily consist of derivatives from a DEM or satellite imagery. Prior to any prediction it is also necessary to ensure the geodata files have the same projection, extent, and cell size. Once we have the necessary files we can construct a list in R of the file names and paths, read the geodata into R, and then extract the geodata values where they intersect with field data.
folder <- "F:/geodata/project_data/8VIC/ca794/"
files <- c(
elev = "ned30m_8VIC.tif", # elevation
slope = "ned30m_8VIC_slope5.tif", # slope gradient
aspect = "ned30m_8VIC_aspect5.tif", # slope aspect
twi = "ned30m_8VIC_wetness.tif", # topographic wetness index
twi_sc = "ned30m_8VIC_wetness_sc.tif", # transformed twi
ch = "ned30m_8VIC_cheight.tif", # catchment height
z2str = "ned30m_8VIC_z2stream.tif", # height above streams
mrrtf = "ned30m_8VIC_mrrtf.tif", # multiresolution ridgetop flatness index
mrvbf = "ned30m_8VIC_mrvbf.tif", # multiresolution valley bottom flatness index
solar = "ned30m_8VIC_solar.tif", # solar radiation
precip = "prism30m_8VIC_ppt_1981_2010_annual_mm.tif", # annual precipitation
precipsum = "prism30m_8VIC_ppt_1981_2010_summer_mm.tif", # summer precipitation
temp = "prism30m_8VIC_tmean_1981_2010_annual_C.tif", # annual temperature
ls = "landsat30m_8VIC_b123457.tif", # landsat bands
pc = "landsat30m_8VIC_pc123456.tif", # principal components of landsat
tc = "landsat30m_8VIC_tc123.tif", # tasseled cap components of landsat
k = "gamma30m_8VIC_namrad_k.tif", # gamma radiometrics signatures
th = "gamma30m_8VIC_namrad_th.tif",
u = "gamma30m_8VIC_namrad_u.tif",
cluster = "cluster152.tif" # unsupervised classification
)
# combine the folder directory and file names
geodata_f <- paste0(folder, files)
# Create a raster stack
geodata_r <- stack(geodata_f)
# Extract the geodata and imbed in a data frame
data <- data.frame(
as.data.frame(pedons_sp)[c("pedon_id", "taxonname", "frags", "x_std", "y_std", "describer2", "landform.string", "landform", "tax_subgroup")],
extract(geodata_r, pedons_sp)
)
# Modify some of the geodata variables
data$mast <- data$temp - 4
idx <- aggregate(mast ~ cluster, data = data, function(x) round(mean(x, na.rm = TRUE), 2))
names(idx)[2] <- "cluster_mast"
data <- merge(data, idx, by = "cluster", all.x = TRUE)
data$cluster <- factor(data$cluster, levels = 1:15)
data$cluster2 <- reorder(data$cluster, data$cluster_mast)
data$gsi <- with(data, (ls_3 - ls_1) / (ls_3 + ls_2 + ls_1))
data$ndvi <- with(data, (ls_4 - ls_3) / (ls_4 + ls_3))
data$sw <- cos(data$aspect - 255)
# save(data, ca794, pedons, file = "C:/workspace/ch7_data.Rdata")
# Strip out location and personal information before uploading to the internet
# s[c("describer", "describer2", "x", "y", "x_std", "y_std", "utmnorthing", "utmeasting", "classifier")] <- NA
# slot(pedons, "site") <- s
# data[c("describer2", "x_std", "y_std")] <- NA
# save(data, ca794, pedons, file = "C:/workspace/stats_for_soil_survey/trunk/data/ch7_data.Rdata")
With our spatial data in hand, we can now see whether any of the variables will help us separate the presence/absence of argillic horizons. Because we’re dealing with a classification problem, we’ll compare the numeric variables using boxplots. What we’re looking for are variables with the least amount of overlap in their distribution (i.e. the greatest separation in their median values).
# Load data
load(file = "C:/workspace/ch7_data.Rdata")
train <- data
# Select argillic horizons with "arg" in the subgroup name and on fans
# Argillic horizons that occur on hills and mountains more than likely form by different process, and therefore would require a different model.train$argillic
train$argillic <- ifelse(grepl("arg", train$tax_subgroup) &
train$mrvbf > 0.15,
TRUE, FALSE
)
train <- subset(train, !is.na(argillic), select = - c(pedon_id, taxonname, x_std, y_std, landform.string, cluster, cluster_mast, argillic.horizon, tax_subgroup, frags))
train2 <- subset(train, select = - c(describer2, landform, cluster2))
data_m <- melt(train2, id = "argillic")
ggplot(data_m, aes(x = argillic, y = value)) +
geom_boxplot() +
facet_wrap(~ variable, scales = "free")
# Transform twiI, as argillic horizons seem to occur over a limited range
aggregate(twi ~ argillic, data = train, summary)
## argillic twi.Min. twi.1st Qu. twi.Median twi.Mean twi.3rd Qu. twi.Max.
## 1 FALSE 8.131 9.240 11.300 12.100 14.440 24.440
## 2 TRUE 10.070 12.800 13.830 13.740 14.690 19.470
train$twi_sc <- abs(train$twi - 13.8)
Modeling is an iterative process that cycles between fitting and evaluating alternative models. Compared to tree and forest models, linear and generalized models require more input from the user. Automated model selection procedures are available, but are discouraged because they generally result in complex and unstable models. This is in part due to correlation amongst the predictive variables that can confuse the model. In addition, the order is which the variables are included or excluded from the model effects the significance of the others, and thus several weak predictors might mask the effect of one strong predictor. Therefore, it is best to begin with a selection of predictors that are known to be useful, and grow the model incrementally.
The example below is known as a forward selection procedure, where a full model is fit and compared against a null model, to assess the effect of the different predictors. For testing alternative models the Akaike’s Information Criterion (AIC) is used. When using the AIC to assess predictor significance, a smaller number is better.
full <- glm(argillic ~ ., data = train, family = binomial()) # "~ ." includes all columns in the data set
null <- glm(argillic ~ 1, data = train, family = binomial()) # "~ 1" just includes an intercept
add1(null, full, test = "Chisq") # using the AIC test the effect of adding additional predictors, generally select the predictor with the smallest AIC unless it goes against your intuition
## Single term additions
##
## Model:
## argillic ~ 1
## Df Deviance AIC LRT Pr(>Chi)
## <none> 872.18 891.63
## describer2 3 853.25 878.70 18.932 0.0002824 ***
## landform 1 768.81 790.26 103.377 < 2.2e-16 ***
## elev 1 869.79 891.24 2.398 0.1215117
## slope 1 704.62 726.07 167.568 < 2.2e-16 ***
## aspect 1 869.85 891.30 2.328 0.1270350
## twi 1 837.20 858.65 34.983 3.327e-09 ***
## twi_sc 1 688.26 709.71 183.918 < 2.2e-16 ***
## ch 1 870.81 892.26 1.370 0.2418957
## z2str 1 805.75 827.20 66.429 3.628e-16 ***
## mrrtf 1 823.82 845.27 48.365 3.538e-12 ***
## mrvbf 1 819.41 840.86 52.775 3.740e-13 ***
## solar 1 861.83 883.28 10.353 0.0012926 **
## precip 1 864.20 885.65 7.986 0.0047148 **
## precipsum 1 854.28 875.73 17.901 2.327e-05 ***
## temp 1 870.08 891.53 2.102 0.1470626
## ls_1 1 871.93 893.38 0.251 0.6163244
## ls_2 1 869.79 891.24 2.390 0.1220857
## ls_3 1 864.88 886.33 7.300 0.0068948 **
## ls_4 1 860.60 882.05 11.579 0.0006670 ***
## ls_5 1 846.15 867.60 26.034 3.355e-07 ***
## ls_6 1 847.19 868.64 24.997 5.741e-07 ***
## pc_1 1 855.78 877.23 16.399 5.132e-05 ***
## pc_2 1 836.95 858.40 35.233 2.926e-09 ***
## pc_3 1 858.12 879.57 14.061 0.0001769 ***
## pc_4 1 869.35 890.80 2.828 0.0926066 .
## pc_5 1 872.17 893.62 0.015 0.9017456
## pc_6 1 863.85 885.30 8.338 0.0038833 **
## tc_1 1 861.58 883.03 10.606 0.0011273 **
## tc_2 1 869.32 890.77 2.863 0.0906327 .
## tc_3 1 836.92 858.37 35.260 2.886e-09 ***
## k 1 870.13 891.58 2.051 0.1521487
## th 1 871.26 892.71 0.920 0.3374110
## u 1 872.02 893.47 0.162 0.6877600
## mast 1 870.08 891.53 2.102 0.1470626
## cluster2 12 760.01 803.46 112.172 < 2.2e-16 ***
## gsi 1 835.26 856.71 36.921 1.230e-09 ***
## ndvi 1 868.77 890.22 3.409 0.0648437 .
## sw 1 872.16 893.61 0.023 0.8802474
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see as the boxplots showed earlier that twi_sc has the smallest AIC and reduces the deviances the most. So let’s add twi_sc to the null model using the update() function. Then continue using the add1() or drop1() functions, until the model is saturated.
argi_glm <- update(null, . ~ . + twi_sc) # add twi_sc to the model, "-" will subtract predictors
# or refit
# argi_glm <- glm(argillic ~ twi_sc, data = train, family = binomial(link = "cloglog"))
# add1(argi_glm, full, test = "Chisq") # iterate until the model is saturated
# drop1(argi_glm, test = "Chisq") # test effect of dropping a predictor
argi_glm <- glm(argillic ~ twi_sc + slope + ls_1 + ch + z2str + mrvbf, data = train, family = binomial())
summary(argi_glm) # examine the effect and error for each predictors
##
## Call:
## glm(formula = argillic ~ twi_sc + slope + ls_1 + ch + z2str +
## mrvbf, family = binomial(), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.55921 -0.56704 -0.11035 -0.00256 2.90359
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.069968 0.815232 4.992 5.96e-07 ***
## twi_sc -0.639211 0.097629 -6.547 5.86e-11 ***
## slope -0.247058 0.053860 -4.587 4.49e-06 ***
## ls_1 -0.025050 0.009370 -2.674 0.00751 **
## ch -0.003184 0.001180 -2.698 0.00698 **
## z2str -0.020988 0.006048 -3.470 0.00052 ***
## mrvbf -0.311831 0.134845 -2.313 0.02075 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 875.83 on 989 degrees of freedom
## Residual deviance: 599.25 on 983 degrees of freedom
## (40 observations deleted due to missingness)
## AIC: 613.25
##
## Number of Fisher Scoring iterations: 8
After the model is saturated you should end up with a model similar to the one above.
After we’re satisfied no additional variables will improve the fit, we need to evaluate it’s residuals, collinearity, accuracy, and model coefficients.
# Standard diagnostic plots for glm() objects
# plot(argi_glm) # plot regression diagnostics
# Term and partial residual plots
# termplot(argi_glm, partial.resid = TRUE)
The variance inflation factor (VIF) is used to assess collinearity amongst the predictors. Its square root indicates the amount of increase in the predictor coefficients standard error. A value greater than 2 indicates a doubling the standard error. Rules of thumb vary, but a square root of vif greater than 2 or 3 indicates an unacceptable value.
# Variance inflation, greater than 5 or 10 is bad
vif(argi_glm)
## twi_sc slope ls_1 ch z2str mrvbf
## 1.098068 1.681115 1.171283 1.156552 1.373327 1.892782
Because we’re dealing with a classification problem, we have to consider both errors of commission (Type I) and omission (Type II), or their corresponding accuracies of sensitivity (producer’s accuracy) and specificity (user’s accuracy) respectively. Before we can assess the error, however, we need to select a probability threshold.
comp <- cbind(train[c("argillic", "cluster2")], pred = predict(argi_glm, train, type = "response") > 0.5)
confusionMatrix(comp$pred, comp$argillic, positive = "TRUE")
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 802 113
## TRUE 28 47
##
## Accuracy : 0.8576
## 95% CI : (0.8342, 0.8788)
## No Information Rate : 0.8384
## P-Value [Acc > NIR] : 0.05332
##
## Kappa : 0.331
## Mcnemar's Test P-Value : 1.504e-12
##
## Sensitivity : 0.29375
## Specificity : 0.96627
## Pos Pred Value : 0.62667
## Neg Pred Value : 0.87650
## Prevalence : 0.16162
## Detection Rate : 0.04747
## Detection Prevalence : 0.07576
## Balanced Accuracy : 0.63001
##
## 'Positive' Class : TRUE
##
# Deviance squared
D2 <- with(argi_glm, round((null.deviance - deviance) / null.deviance, 2))
# Adjusted deviance squared
adjD2 <- with(argi_glm, round(1 - ((df.null / df.residual) * (1 - D2)), 2))
adjD2
## [1] 0.32
comp_sub <- subset(comp, argillic == TRUE)
temp <- by(comp_sub, list(comp_sub$cluster), function(x) with(x, data.frame(
cluster = unique(cluster2),
sum_arg = sum(argillic, na.rm = T),
sum_pred = sum(pred, na.rm = T),
sensitivity = round(sum(pred == argillic) / length(argillic), 2)
)))
temp <- do.call(rbind, temp)
temp
## cluster sum_arg sum_pred sensitivity
## 5 5 32 14 0.44
## 15 15 27 6 0.22
## 14 14 20 11 0.55
## 12 12 1 0 0.00
## 2 2 32 3 0.09
## 13 13 8 1 0.12
## 3 3 13 6 0.46
## 6 6 5 1 0.20
## 7 7 17 2 0.12
## 8 8 5 3 0.60
ggplot(temp, aes(x = cluster, y = sensitivity)) +
geom_point()
# Remove hyperthermic points
train_sub <- subset(train, temp < 22 + 4)
# full <- glm(argillic ~ ., data = train_sub, family = binomial(link = "cloglog"))
# null <- glm(argillic ~ 1, data = train_sub, family = binomial(link = "cloglog"))
# add1(null, full, train = "Chisq")
sub_glm <- glm(argillic ~ slope + twi_sc + ls_1 + mrvbf + z2str + ch, data = train_sub, family = binomial())
# summary(sub_glm)
comp <- cbind(train_sub[c("argillic", "cluster2")], pred = predict(sub_glm, train_sub, type = "response") > 0.4)
confusionMatrix(comp$pred, comp$argillic, positive = "TRUE")
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 375 38
## TRUE 53 60
##
## Accuracy : 0.827
## 95% CI : (0.7919, 0.8584)
## No Information Rate : 0.8137
## P-Value [Acc > NIR] : 0.2350
##
## Kappa : 0.4612
## Mcnemar's Test P-Value : 0.1422
##
## Sensitivity : 0.6122
## Specificity : 0.8762
## Pos Pred Value : 0.5310
## Neg Pred Value : 0.9080
## Prevalence : 0.1863
## Detection Rate : 0.1141
## Detection Prevalence : 0.2148
## Balanced Accuracy : 0.7442
##
## 'Positive' Class : TRUE
##
comp_sub <- subset(comp, argillic == TRUE)
temp <- by(comp_sub, list(comp_sub$cluster2), function(x) with(x, data.frame(
cluster = unique(cluster2),
sum_arg = sum(argillic, na.rm = T),
sum_pred = sum(pred, na.rm = T),
sensitivity = round(sum(pred == argillic) / length(argillic), 2))))
temp <- do.call(rbind, temp)
temp
## cluster sum_arg sum_pred sensitivity
## 5 5 32 19 0.59
## 15 15 27 17 0.63
## 14 14 20 16 0.80
## 12 12 1 0 0.00
## 2 2 17 8 0.47
## 13 13 1 0 0.00
ggplot(temp, aes(x = cluster, y = sensitivity)) +
geom_point()
# Examine the coefficients
summary(argi_glm)
##
## Call:
## glm(formula = argillic ~ twi_sc + slope + ls_1 + ch + z2str +
## mrvbf, family = binomial(), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.55921 -0.56704 -0.11035 -0.00256 2.90359
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.069968 0.815232 4.992 5.96e-07 ***
## twi_sc -0.639211 0.097629 -6.547 5.86e-11 ***
## slope -0.247058 0.053860 -4.587 4.49e-06 ***
## ls_1 -0.025050 0.009370 -2.674 0.00751 **
## ch -0.003184 0.001180 -2.698 0.00698 **
## z2str -0.020988 0.006048 -3.470 0.00052 ***
## mrvbf -0.311831 0.134845 -2.313 0.02075 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 875.83 on 989 degrees of freedom
## Residual deviance: 599.25 on 983 degrees of freedom
## (40 observations deleted due to missingness)
## AIC: 613.25
##
## Number of Fisher Scoring iterations: 8
# Convert the coefficients to an odds scale, who here gambles?
round(binomial(link = "logit")$linkinv(coef(argi_glm)), 2)
## (Intercept) twi_sc slope ls_1 ch z2str
## 0.98 0.35 0.44 0.49 0.50 0.49
## mrvbf
## 0.42
# Importance of each predictor assessed by the amount of deviance they explain
anova(argi_glm)
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: argillic
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev
## NULL 989 875.83
## twi_sc 1 185.455 988 690.38
## slope 1 50.802 987 639.58
## ls_1 1 19.950 986 619.63
## ch 1 4.694 985 614.93
## z2str 1 10.116 984 604.82
## mrvbf 1 5.561 983 599.25
# Custom function to return the predictions and their standard errors
predfun <- function(model, data) {
v <- predict(model, data, type = "response", se.fit = TRUE)
cbind(
p = as.vector(v$fit),
se = as.vector(v$se.fit)
)
}
# Generate spatial predictions
# r <- predict(geodata_r, argi_glm, fun = predfun, index = 1:2, progress = "text")
# Export the results
# writeRaster(r[[1]], "argi.tif", overwrite = T, progress = "text")
# writeRaster(r[[2]], "argi_se.tif", overwrite = T, progress = "text")
plot(raster("C:/workspace/argi.tif"))
plot(ca794, add = TRUE)
plot(raster("C:/workspace/argi_se.tif"))
plot(ca794, add = TRUE)
Beaudette, D. E., & O’Geen, A. T, 2009. Quantifying the aspect effect: an application of solar radiation modeling for soil survey. Soil Science Society of America Journal, 73:1345-1352
Gessler, P. E., Moore, I. D., McKenzie, N. J., & Ryan, P. J, 1995. Soil-landscape modelling and spatial prediction of soil attributes. International Journal of Geographical Information Systems, 9:421-432
Gorsevski, P. V., Gessler, P. E., Foltz, R. B., & Elliot, W. J, 2006. Spatial prediction of landslide hazard using logistic regression and ROC analysis. Transactions in GIS, 10:395-415
Evans, D.M. and Hartemink, A.E., 2014. Digital soil mapping of a red clay subsoil covered by loess. Geoderma, 230:296-304.
Hosmer Jr, D.W., Lemeshow, S. and Sturdivant, R.X., 2013. Applied logistic regression (Vol. 398). John Wiley & Sons
Kempen, B., Brus, D. J., Heuvelink, G., & Stoorvogel, J. J. (2009). Updating the 1: 50,000 Dutch soil map using legacy soil data: A multinomial logistic regression approach. Geoderma, 151:311-326.
Nauman, T. W., and J. A. Thompson, 2014. Semi-automated disaggregation of conventional soil maps using knowledge driven data mining and classification trees. Geoderma 213:385-399. http://www.sciencedirect.com/science/article/pii/S0016706113003066
Peterson, F.F., 1981. Landforms of the basin and range province: defined for soil survey. Nevada Agricultural Experiment Station Technical Bulletin 28, University of Nevada - Reno, NV. 52 p. http://jornada.nmsu.edu/files/Peterson_LandformsBasinRangeProvince.pdf
Shi, X., L. Girod, R. Long, R. DeKett, J. Philippe, and T. Burke, 2012. A comparison of LiDAR-based DEMs and USGS-sourced DEMs in terrain analysis for knowledge-based digital soil mapping. Geoderma 170:217-226. http://www.sciencedirect.com/science/article/pii/S0016706111003387
Lane, P.W., 2002. Generalized linear models in soil science. European Journal of Soil Science 53, 241- 251. http://onlinelibrary.wiley.com/doi/10.1046/j.1365-2389.2002.00440.x/abstract
James, G., D. Witten, T. Hastie, and R. Tibshirani, 2014. An Introduction to Statistical Learning: with Applications in R. Springer, New York. http://www-bcf.usc.edu/~gareth/ISL/
Hengl, T. 2009. A Practical Guide to Geostatistical Mapping, 2nd Edt. University of Amsterdam, www.lulu.com, 291 p. ISBN 978-90-9024981-0. http://spatial-analyst.net/book/system/files/Hengl_2009_GEOSTATe2c0w.pdf